# Multimodal Large Language Model
SAIL 7B
Apache-2.0
SAIL is a single Transformer model specifically designed for vision and language, serving as a unified Multimodal Large Language Model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture.
Image-to-Text
Transformers

S
ByteDance-Seed
119
11
Internvl3 2B AWQ
Other
InternVL3-2B is an advanced Multimodal Large Language Model (MLLM) developed by OpenGVLab, featuring exceptional multimodal perception and reasoning capabilities, supporting tool usage, GUI agents, industrial image analysis, 3D visual perception, and more.

I
OpenGVLab
677
1
Internvl3 1B
Other
InternVL3-1B is a 1B-parameter multimodal large language model in the InternVL3 series, integrating the InternViT visual encoder and Qwen2.5 language model, with exceptional multimodal perception and reasoning capabilities.

I
FriendliAI
71
1
Ovis2 1B Dev
Apache-2.0
Ovis2-1B is the latest member of the Ovis series of multimodal large language models (MLLM), focusing on structural alignment of vision and text embeddings, featuring high performance for small models, enhanced reasoning capabilities, video and multi-image processing, and multilingual OCR enhancement.
Text-to-Image
Transformers Supports Multiple Languages

O
Isotr0py
79
1
Video R1 7B
Apache-2.0
Video-R1-7B is a multimodal large language model optimized based on Qwen2.5-VL-7B-Instruct, focusing on video reasoning tasks, capable of understanding video content and answering related questions.
Video-to-Text
Transformers English

V
Video-R1
2,129
9
Finedefics
Finedefics is an open-source multimodal large language model (MLLM) that enhances fine-grained visual recognition (FGVR) capabilities by incorporating object attribute descriptions.
Image-to-Text
F
StevenHH2000
82
6
Videorefer 7B Stage2.5
Apache-2.0
VideoRefer-7B is a multimodal model based on a video large language model, focusing on spatio-temporal object understanding tasks.
Text-to-Video
Transformers English

V
DAMO-NLP-SG
20
2
P MoD LLaVA NeXT 7B
Apache-2.0
p-MoD is a hybrid-depth multimodal large language model built using the progressive ratio decay method, supporting image-to-text generation tasks.
Image-to-Text
P
MCG-NJU
74
4
Eagle X5 7B
Eagle is a series of vision-centric high-resolution multimodal large language models, supporting input resolutions up to 1K and above, excelling in tasks such as optical character recognition and document understanding.
Image-to-Text
Transformers

E
NVEagle
918
26
M3D LaMed Llama 2 7B
Apache-2.0
M3D is a 3D medical image analysis technology based on multimodal large language models, including the M3D-Data dataset, M3D-LaMed model, and M3D-Bench evaluation benchmark.
Image-to-Text
Transformers

M
GoodBaiBai88
209
2
Featured Recommended AI Models